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The bias/variance tradeoff is fundamental to learning: increasing a model's complexity can improve 
its fit on training data, but potentially worsens performance on future samples |fl~). Remarkably, 
however, the human brain effortlessly handles a wide-range of complex pattern recognition tasks. 
On the basis of these conflicting observations, it has been argued that useful biases in the form of 
"generic mechanisms for representation" must be hardwired into cortex (2). 

This note describes a useful bias that encourages cooperative learning which is both biologically 
plausible and rigorously justified J3]^]. 

Let us outline the problem. Neurons learn inductively. They generalize from finite samples and 
encode estimates of future outcomes (for example, rewards) into their spiketrains iflOl . Results from 
learning theory imply that generalizing successfully requires strong biases HI or, in other words, 
specialization. Thus, at any given time some neurons' specialties are more relevant than others. 
Since most of the data neurons receive are other neurons' outputs, it is essential that neurons indicate 
which of their outputs encode high quality estimates. Downstream neurons should then be biased to 
specialize on these outputs. 

The proposed biasing mechanism is based on a constraint on the effective information, ei, gener- 
ated by spikes, see Eq. Q below. The motivation for using effective information comes from a 
connection to learning theory explained in f|2] There, we show the ei generated by empirical risk 
minimization quantifies capacity: higher ei yields tighter generalization bounds. 

Sections Sj3]and Sj4]consider implications of the constraint in two cases: abstractly and for a concrete 
model. In both cases we find that imposing constraint Q implies: (i) essentially all information is 
carried by spikes; (ii) spikes encode reward estimates and (iii) the higher the effective information, 
the better the guarantees on estimates. 

Although the proposal is inspired by cortical learning, the main ideas are information-theoretic, 
suggesting they may also apply to other examples of interacting populations of adaptive agents. 



1 Information 



We model physical systems as input/output devices. For simplicity, we require that inputs X and 
outputs y form finite sets. Systems are not necessarily deterministic. We encode the probability that 
system m outputs y € y given input x G X in Markov matrix P m (y\x). 

Consider two perspectives on a physical system. The first is computational: the system receives 
an input, and we ask what output it will produce according to the probabilistic or deterministic 
rule encoded in P m . The second is inferential: the system produces an output, and we ask what 
information it generates about its input. 

The inferential perspective is based on the notion of Bayesian information gain. Suppose we have 
model P_M.(d\h) that specifies the probability of observing data given a hypothesis and also prior 
distribution P(h) on hypotheses. If we observe data d, how much have we learned about the hy- 
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potheses? The Bayesian information gain is 

D[P M (H\d)\\P(H)], where D[P\\Q] :=^PUog 2 ^f (1) 

l 

is the Kullback-Leibler divergence and Pj^(h\d) is computed via Bayes' rule. 

Effective information is simpljQ the Bayesian information gain of physical system m, where inputs 
and outputs correspond to hypotheses and data respectively |3 j. In the absence of additional con- 
straints, we place the uniform (maximum entropy) prior on inputs. The effective information about 
X generated by m when it outputs y is 

ei(m,y) := D[P m (X\y) || P unif {X)]. (2) 

Effective information quantifies the distinctions in the input that are "visible" to m, insofar as they 
make a different to its output. 

Finally, observe that effective information quantifies selectivity: 

ei(m,y) = H[P uni f(X)] - H[P m (X\y)] = j selectivity of response y by mj, (3) 

total bits available bits indistinguishable to m 

where H[»] denotes entropy. Outputs generating higher effective information therefore trace back 
to more specific causes (i.e. more concentrated posteriors). 



2 Learning 

The information one system generates about another should have implications for their future in- 
teractions. We show this holds for the well-studied special case of empirical risk minimization 
(ERM). Results are taken from [4|, which should be consulted for details. 

Given function class P C = {c : X — > ±1} and unlabelled data V = (xi, . . . ,xi) 6 X e , 
ERM takes labelings of T> to the empirical error of the best-performing classifier in P 
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£r,v -^x — >R:<t^ mm-^2l f(Xz) ^ {xi) . (4) 

l—l 

It is easy to show that ei(£j^ t -p, 0) — I — VCjr(T>), where VC^(T>) is the empirical VC-entropy 
ll4"l [T2"l . It follows with high probability that 
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expected error 
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ei(£jr 73,0) r ~i 

- + < confidence term > . (5) 



I 

effective information 



training eiror 

Effective information answers the question: To what extent does the error take the form it does 
because of the supervisor? Note that the optimal classifier does not depend on the supervisor ("en- 
vironment") directly, but rather on the supervisor factored through £j? t T>, which outputs the error: 
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Cf,v : £u — > P ■<? !-> axgmin-^I^)^^) =: / (6) 

l—l 



A learning algorithm that shatters the data achieves zero empirical error for all supervisors, which 
implies the error is independent of the supervisor. Equivalently, it implies the error achieved by 
ERM generates no information about the supervisor. Increasing the effective information generated 
by £f,t> progressively concentrates the distribution of likely future errors. Guarantees tighten as ei 
increases. 

An empirical risk minimizer's training error is a meaningful indicator of future performance to 
the extent that it generates high effective information. This motivates investigating whether the 
effective information generated when performing more biologically plausible optimizations also has 
implications for future performance. 



'in more general settings, effective information makes essential use of Pearl's interventional calculus II II . 
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3 Cooperative learning in abstract 



Organisms aim to choose beneficial actions in the situations they encounter. Responsibility for mak- 
ing choices falls largely on the cortex. However, cortical neurons interact extraordinarily indirectly 
with the external environment - their inputs and outputs are mediated through millions of other 
neurons that constantly rewire themselves. It follows that, at best, neurons can encode provisional 
estimates of expected outcomes into their spiketrains. 

Guaranteeing and highlighting high quality estimates of future outcomes is therefore essential. In 
particular, since interneuronal communication is dominated by spikes, it is necessary that spikes 
provide meaningful indicators of future outcomes. 

We propose an information-theoretic constraint that simultaneously guarantees and highlights esti- 
mates in populations of interacting learners (SHE). The next section considers a particular model of 
cortical neurons in more detail. 

Let us model abstract learners as adaptable channels X ^> Y with two outputs, suggestively called 
spikes yi and silences t/o- Impose the following 

constraint: ei(m, yi) = A bits, where A 3> ei(m, yo). (*) 

Parameter A controls the fraction of inputs causing the learner to produce y\ (the higher A, the 
smaller the fraction). It makes sense that each learner in a large population should specialize on a 
small fraction of possible inputs. Indeed, increasing A decreases the frequency of y\, so P m {yi) <C 

PM- 

Consequence 1: Spikes carry (essentially all) information. Under constraint (Q, the informa- 
tion transferred by learners is carried by spikes alone 

I(X; Y) = P m ( yi ) ■ ei(m, yi) + O (P m (yi) 2 ) » Pm(vi) • ei(m, Vl ) (7) 
to a first-order approximation [7|. It follows that silent learners can be ignored, despite the fact that 
they typically constitute the bulk of the population's responses (recall: P m (yi) "C P m (yo)). 

This suggests a principled way to distribute credit (6). When a positive/negative global signal is 
released, the few learners with informative responses - that trace back to specific stimuli, recall (O 
- should reinforce/weaken their behaviors. Conversely, the many learners producing uninformative 
responses that do not trace back to specific stimuli should not modify their synapses. It turns out 
this is exactly how neurons modify their synapses, see [j4] 

Consequence 2: Spikes encode high quality reward estimates. Under constraint Q, the effec- 
tive information generated by reward maximizing^ learners essentially equals their empirical relative 
reward [7 |. Thus, outputs with high ei indicate high empirical reward relative to alternatives. 

Furthermore, constraint ensures that the information encoded in spikes is reliable. Applying a 
PAC-Bayes inspired variant of Ockham's razor 03], it can be shown that the higher the effective 
information generated by spikes, the smaller the difference between the empirical reward estimate 
R m and expected reward R m : 

\R m -R m \ <M-\ e ' (m 'f } + t + (confidence term} . (8) 

accuracy of reward estimate s v ' 

term that decreases as ei increases 

Summary. The information-theoretic constraint Q introduces an asymmetry into outputs, ensures 
that spikes carry reliable information about future rewards. 

4 Cooperative learning in cortex 

Spike-timing dependent plasticity (STDP) is a standard and frequently extended model of plasticity 
fl6l . Unfortunately, it operates in continuous time and is difficult to analyze using standard learning- 
theoretic techniques. Recently [5 1, we investigated the fast-time constant limit of STDP. Taking the 

formalized as a constrained optimization following the free-energy approaches developed in 1 1 311 141 . 
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limit strips out the exponential discount factors, essentially reducing STDP to a simpler discrete time 
algorithm whilst preserving its essential structure. 

We recall STDP. Given a presynaptic spike at time tj and postsynaptic at time tk, the strength of 
synapse j — ^ fc is updated according to 



STDP modifies synapses when input and output spikes co-occur in a short time window, so that 
spikes gate learning. This makes sense if spikes are selective, and so by (0 carry most of the 
information in cortex. 

Synaptic weights control expected error. If STDP incorporates neuromodulatory signals then it 
can be shown to encode reward estimates into spikes in the fast-time constant limit r. — > 0, see 151 . 
The accuracy of these estimates is controlled by synaptic weights. 

More precisely, let v : X — > K be a random variable representing neuromodulatory signals; I denote 
the error, i.e. whether a spike is followed by a negative neuromodulatory signal; and R the empirical 
reward after spiking [5|. The expected error is controlled by the sum to of neuron m's synaptic 
weights and m's empirical reward: 

< ci • uj 1 — R(x^\ m, v) + uj ■ /capacity termj + /confidence termj. (10) 



Thus, the lower synaptic weights to, the better the quality of spikes as indicators of future outcomes. 

Synaptic weights are homeostatically regulated. Spikes are both metabolically expensive ifTTl 
and selective: they typically occur in response to specific stimuli - e.g. an edge or a familiar face 
fT8l . Using spikes selectively reduces metabolic expenditures, which is important since the brain 
accounts for a disproportionate fraction of the body's total energy budget fl9l . 

There is evidence that synaptic strengths increase on average during wakefulness and are downscaled 
during sleep 19,20,21]. This suggests that homeostatic regulation during sleep may both reduce 
metabolic costs and simultaneously improve guarantees on reward estimates. Finally, it is easy to 
show that decreasing synaptic weights increases the effective information generated by spikes. 

5 Discussion 

There is an interesting analogy between spikes and paper currency. Money plays many overlapping 
roles in an economy, including: (i) focusing attention; (ii) stimulating activity; and (iii) providing a 
quantitative lingua franca for tracking revenues and expenditures. 

Spikes may play similar roles in cortex. Spikes focus attention: STDP and other proposed learning 
rules are particularly sensitive to spikes and spike timing. Spikes stimulate activity: input spikes 
cause output spikes. Finally, spikes leave trails of (Calcium) traces that are used to reinforce and 
discourage neuronal behaviors in response to neuromodulatory signals. 

Neither money nor spikes are intrinsically valuable. Currency can be devalued by inflation. Simi- 
larly, the information content and guarantees associated with spikes are eroded by overpotentiating 
synapses which reduces their selectivity (potentially leading to epileptic seizures in extreme cases). 
Regulating the information content of spikes is therefore essential. Eq. Q provides a simple con- 
straint that can be approximately imposed by regulating synaptic weights. Indeed, there is evidence 
that one of the functions of sleep is precisely this. 

Spikes with high information content are valuable because they come with strong guarantees on 
their estimates. They are therefore worth paying attention to, worth responding to, worth keeping 
track of, and worth learning from. 

Acknowledgements. I thank Michel Besserve, Samory Kpotufe, Pedro Ortega for useful discus- 
sions. 
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